First, get all depencies and packages necessary to complete all tasks.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("GEOmetadb", quietly = TRUE))
BiocManager::install("GEOmetadb")
if (!requireNamespace("kableExtra", quietly = TRUE))
BiocManager::install("kableExtra")
if (!requireNamespace("gridExtra", quietly = TRUE))
install.packages("gridExtra")
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE))
BiocManager::install("org.Hs.eg.db")
if (!requireNamespace("knitr", quietly = TRUE))
install.packages("knitr")
if (!requireNamespace("biomaRt", quietly = TRUE))
BiocManager::install("biomaRt")
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (!requireNamespace("circlize", quietly = TRUE))
BiocManager::install("circlize")
if (!requireNamespace("limma", quietly = TRUE))
BiocManager::install("limma")
library(edgeR)
library(limma)
library(circlize)
library(ComplexHeatmap)
library(gridExtra)
library(knitr)
library(kableExtra)
library(GEOmetadb)
library(biomaRt)
library(org.Hs.eg.db)
The Expression Dataset I chose was GSE66306: Impact of bariatric surgery on RNA-seq gene expression profiles of peripheral monocytes in humans(Poitou et al. 2015).
Summary
Genome expression profiles were taken from obese women before, and three months after bariatric surgery.
Control and Test Conditions
The conditions that were tested were:
Before bariatric surgery (T0)
3 months after bariatric surgery (T3)
Why I was Interested
I have always been interested in health, and maintaining a healthy lifestyle; I was a competitive gymnast until I came to university. I always knew being healthy (whether that means eating healthily or being active / working out) had amazing benefits on your physical, mental, and emotional health. So a study that focused on health, and how a procedure like bariatric surgery can improve someone’s physical health, was of high interest to me.
Downloading the Dataset and initial previewing
Now that we have looked into information about the dataset, let’s download my dataset using BiocManager(Morgan 2019) and GEOmetadb(Zhu et al. 2008)!
sfiles = getGEOSuppFiles('GSE66306')
fnames = rownames(sfiles)
PM_exp = read.delim(fnames[1],header=TRUE,
check.names = FALSE, stringsAsFactors = FALSE)
Let’s first find out the dimensions of my dataset: 23354, 40
This indicates that there are 23354 rows and 40 columns!
Let’s take a quick look at what the first few rows and columns of my data looks like:
| Gene Name | Ensembl Gene ID | PM01_T0 | PM01_T3 | PM02_T0 | PM02_T3 |
|---|---|---|---|---|---|
| 1/2-SBSRNA4 | NA | 62.71930 | 62.29415 | 66.85663 | 41.28218 |
| A1BG | ENSG00000121410 | 97.01571 | 87.47226 | 67.36065 | 83.48856 |
| A1BG-AS1 | ENSG00000268895 | 63.30869 | 69.03770 | 62.58948 | 26.72385 |
| A1CF | ENSG00000148584 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| A2LD1 | NA | 174.01275 | 155.20349 | 71.78328 | 90.46922 |
A quick summary of what I observed about my dataset:
There are 23354 genes
There are gene names, Ensemble gene IDs, and 12 different test cases (two situations per patient, per gene)
Not all genes have Ensembl gene IDs
The gene names used are either the HUGO approved symbol or an alias symbol
Organize the dataset into patient IDs and cell types
Before doing further analysis of the dataset, I first want to create a table that lists all patients and easily displays the patient ID as well as the specific cell type analyzed.
| patients | time | |
|---|---|---|
| PM01_T0 | PM01 | T0 |
| PM01_T3 | PM01 | T3 |
| PM02_T0 | PM02 | T0 |
| PM02_T3 | PM02 | T3 |
| PM05_T0 | PM05 | T0 |
| PM05_T3 | PM05 | T3 |
| PM06_T0 | PM06 | T0 |
| PM06_T3 | PM06 | T3 |
| PM08_T0 | PM08 | T0 |
| PM08_T3 | PM08 | T3 |
Filter weakly expressed features from my dataset
Now, back to my dataset. I want to filter out weakly expressed features, using edgeR:
The filtered dimesions of the dataset now are: 13083, 40.
This means that 10271 genes were removed. That means there were 10271 outliers.
Edit the HUGO gene symbols and Ensembl Gene IDs
As mentioned above, some of the genes are missing Ensembl gene IDs. This is a large issue and I had lots of difficulty trying to salvage as many genes as I could that were missing the Ensembl gene IDs.
First, I tried to separate the genes that were missing ensembl gene IDs from the other genes:
na_gene_ids <- PM_exp_filtered[which(is.na(PM_exp_filtered$`Ensembl Gene ID`)), 1]
There are 1064 genes without ensembl gene ids!
I also read in the paper that they used hg19 instead of the most recent ensembl. Therefore, after some google searching, I came across this article that states that hg19 is equivalent to Ensembl’s GRCh37. As we were shown how to use Ensembl, I went with GRCh37 for all future queries.
Method 1: Match the gene names given in dataset to Ensembl IDs
In my dataset I was given gene ids - some of these were the same as HUGO symbols, while some were aliases or older symbols. I tried to use these to find the associated ensembl gene ids:
grch37 = useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice") # From https://support.bioconductor.org/p/62064/
ensembl_grch37 = useDataset("hsapiens_gene_ensembl",mart=grch37)
is_na <- getBM(attributes = c("wikigene_name", "ensembl_gene_id"),
filters = c("wikigene_name"),
values = na_gene_ids,
mart = ensembl_grch37)
I was fortunate to find 317 of my ensembl ids. I put them back into the dataset by:
for (i in 1:nrow(is_na)) {
gene_name <- is_na[i,]$wikigene_name
ensembl_gene_id <- is_na[i,]$ensembl_gene_id
index <- which(PM_exp_filtered$`Gene Name` == gene_name)
PM_exp_filtered[index,2] <- ensembl_gene_id
}
still_na <- na_gene_ids[which(!na_gene_ids %in% is_na$wikigene_name)] #remove all now identified gene names
Now, I am missing 808 ensembl ids.
Method 2: Use entrez gene ids on genes that begin with LOC
In my dataset there are quite a few genes that begin with the letters LOC. Dr. Isserlin suggested that if the LOC is removed, these ids can be used as entrez gene ids! I then separated all gene ids that began with LOC and performed a query to use the numbers from the gene ids (that began with LOC) to find matching ensembl gene ids:
LOC_indexes <- grep("^LOC", still_na) # Find all gene names beginning with LOC
LOC_names <- still_na[LOC_indexes]
no_LOC_names <- gsub("^LOC", "", LOC_names) # Remove all of the LOC from every gene name beginning with LOC
length(no_LOC_names) #362
## [1] 362
LOC_grch37 <- getBM(attributes = c("entrezgene_id", "ensembl_gene_id"),
filters = c("entrezgene_id"),
values = no_LOC_names,
mart = ensembl_grch37)
I was able to find 245. Now I will put them back into my dataset by:
for (i in 1:nrow(LOC_grch37)) {
gene_name <- paste0("LOC", toString(LOC_grch37[i,]$entrezgene_id)) # Add back LOC that they will match with gene names
ensembl_gene_id <- is_na[i,]$ensembl_gene_id
index <- which(PM_exp_filtered$`Gene Name` == gene_name)
PM_exp_filtered[index,2] <- ensembl_gene_id
}
left_LOC_na <- no_LOC_names[which(!no_LOC_names %in% LOC_grch37$entrezgene_id)] # Find all gene names that start with LOC
left_na <- still_na[-LOC_indexes] # Remove all gene names that start with LOC from the <NA> list
I am still left with 446 to attempt to find the ensembl gene ids for. I removed all ids that began with LOC from the list of indices I have left to check as that was the only check that would work in finding ensembl gene ids for genes beginning with LOC.
Method 3: Use list of known aliases to match with dataset gene names
When I was trying to find a solution to my missing ensembl ids, I came across this website and decided to use this as well! I will try and find proper gene names that map to my dataset’s gene names, and use those to find ensembl gene ids.
# To get the list of gene names and aliases
dbCon <- org.Hs.eg_dbconn()
sqlQuery <- 'SELECT * FROM alias, gene_info WHERE alias._id == gene_info._id;'
aliasSymbol <- dbGetQuery(dbCon, sqlQuery)
m <- matrix(ncol=2, byrow=TRUE)
colnames(m) <- c('old_symbol', 'new_symbol') # Old symbol is our gene name, new symbol is matching gene name
all_new_symbols <- c()
for (val in left_na) {
if (val %in% aliasSymbol$alias_symbol) {
index <- which(aliasSymbol$alias_symbol == val)
proper_symbol <- aliasSymbol[index,]$symbol[1]
m <- rbind(m, c(val, proper_symbol)) #to form association b/w the two
all_new_symbols <- c(all_new_symbols, proper_symbol) #for next step, to match ensembl gene ids with
}
}
# Get the ensembl gene ids that map to the new gene names
ensembl_w_new_names <- getBM(attributes = c("wikigene_name", "ensembl_gene_id"),
filters = c("wikigene_name"),
values = all_new_symbols,
mart = ensembl_grch37)
# Now, put all of the ensembl gene IDs into the chart
for (i in 1:nrow(ensembl_w_new_names)) {
gene_name <- ensembl_w_new_names[i,]$wikigene_name # The new name we matched with our gene names
old_gene_name <- m[which(m[,2] == gene_name)] # Gene names in our dataset
ensembl_gene_id <- ensembl_w_new_names[i,]$ensembl_gene_id
index <- which(PM_exp_filtered$`Gene Name` == old_gene_name)
PM_exp_filtered[index,2] <- ensembl_gene_id
}
## Warning in PM_exp_filtered$`Gene Name` == old_gene_name: longer object length is
## not a multiple of shorter object length
This is the last method I could find. Even though there are still some genes that are missing ensembl ids, I will leave them in my dataset as they do have some form of identification, though the gene ids used may be aliases or older hugo symbols.
Finally, to actually find the HUGO symbols that map to all of thse ensembl gene ids and add them to the dataset:
# Find the HUGO symbols
all_HUGO <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
filters = c("ensembl_gene_id"),
values = PM_exp_filtered$`Ensembl Gene ID`,
mart = ensembl_grch37)
PM_exp_filtered$"HUGO_symbol" <- NA # Add HUGO column to dataset
# Put hugo symbols into the dataset
for (i in 1:nrow(all_HUGO)) {
ensembl_num <- all_HUGO[i,]$ensembl_gene_id
hugo_sym <- all_HUGO[i,]$hgnc_symbol
index <- which(PM_exp_filtered$`Ensembl Gene ID` == ensembl_num)
PM_exp_filtered$"HUGO_symbol"[index] <- hugo_sym
}
Now, it is time to check for duplicates!
PM_table <- data.frame(table(PM_exp_filtered$`Ensembl Gene ID`))
all_duplicates <- PM_exp_filtered[PM_exp_filtered$`Ensembl Gene ID` %in% PM_table$Var1[PM_table$Freq > 1],] #check which ensembl ids have a higher frequency than 1, meaning they are duplicated
length(all_duplicates$`Gene Name`) #476
## [1] 476
I can see that my dataset has 476 duplicates! I want to see which of my genes are duplicates:
gene_duplicates <- all_duplicates$`Gene Name`
all_duplicates$`Gene Name`
## [1] "ACBD6" "ACPL2" "AGAP5" "AGAP6"
## [5] "AGAP7" "AGAP8" "AGAP9" "ARMCX5-GPRASP2"
## [9] "ASB3" "ATP1A1OS" "AZI1" "BMS1P5"
## [13] "C10orf118" "C11orf35" "C11orf82" "C12orf23"
## [17] "C12orf44" "C17orf103" "C17orf72" "C19orf55"
## [21] "C19orf59" "C1QTNF6" "C1R" "C1orf115"
## [25] "C1orf162" "C1orf204" "C1orf216" "C1orf233"
## [29] "C1orf27" "C1orf85" "C20orf112" "C20orf201"
## [33] "C22orf26" "C2orf62" "C3orf17" "C3orf18"
## [37] "C3orf37" "C3orf38" "C3orf58" "C3orf62"
## [41] "C4orf21" "C5orf54" "C8orf37" "C8orf44"
## [45] "C8orf44-SGK3" "C8orf45" "C8orf46" "C8orf58"
## [49] "C8orf59" "C8orf76" "C8orf80" "C8orf82"
## [53] "CCDC144B" "CCDC144C" "CXXC1" "CXXC5"
## [57] "CXorf26" "CXorf36" "CXorf56" "CXorf65"
## [61] "CXorf69" "ECRP" "EIF4A2" "FAM105B"
## [65] "FAM211A" "FAM211B" "FAM21A" "FAM21B"
## [69] "FAM21C" "FLJ13197" "FLJ31813" "FLJ33630"
## [73] "FLJ41200" "FLJ45513" "GIMAP1-GIMAP5" "GIMAP5"
## [77] "GOLGA6L10" "GOLGA6L9" "GPR75-ASB3" "GTF2H2C"
## [81] "GTF2H2D" "HIST2H4A" "HIST2H4B" "HOXA9"
## [85] "IL8" "IPW" "ITFG2" "KGFLP2"
## [89] "KIAA0947" "KIAA1009" "KIAA1737" "KIAA1804"
## [93] "LIMS1" "LIMS3L" "LINC00341" "LOC100009676"
## [97] "LOC100128071" "LOC100128191" "LOC100128338" "LOC100128531"
## [101] "LOC100128682" "LOC100128788" "LOC100129196" "LOC100129250"
## [105] "LOC100129269" "LOC100129387" "LOC100129931" "LOC100129961"
## [109] "LOC100130557" "LOC100130581" "LOC100130691" "LOC100130855"
## [113] "LOC100131067" "LOC100131089" "LOC100131096" "LOC100131193"
## [117] "LOC100131434" "LOC100131691" "LOC100131733" "LOC100132077"
## [121] "LOC100132247" "LOC100132273" "LOC100132526" "LOC100132832"
## [125] "LOC100133445" "LOC100133612" "LOC100133991" "LOC100134229"
## [129] "LOC100134368" "LOC100134713" "LOC100190938" "LOC100233156"
## [133] "LOC100233209" "LOC100268168" "LOC100270804" "LOC100271836"
## [137] "LOC100272216" "LOC100272228" "LOC100286844" "LOC100287015"
## [141] "LOC100287042" "LOC100287177" "LOC100287225" "LOC100287314"
## [145] "LOC100287559" "LOC100287722" "LOC100287792" "LOC100288069"
## [149] "LOC100288123" "LOC100288198" "LOC100288432" "LOC100288637"
## [153] "LOC100288748" "LOC100288778" "LOC100288846" "LOC100289019"
## [157] "LOC100289473" "LOC100289561" "LOC100293534" "LOC100294362"
## [161] "LOC100302401" "LOC100306951" "LOC100329109" "LOC100335030"
## [165] "LOC100379224" "LOC100499177" "LOC100499484" "LOC100505495"
## [169] "LOC100505576" "LOC100505648" "LOC100505666" "LOC100505676"
## [173] "LOC100505678" "LOC100505702" "LOC100505715" "LOC100505761"
## [177] "LOC100505776" "LOC100505783" "LOC100505812" "LOC100505854"
## [181] "LOC100505865" "LOC100505989" "LOC100506046" "LOC100506054"
## [185] "LOC100506083" "LOC100506085" "LOC100506100" "LOC100506123"
## [189] "LOC100506124" "LOC100506497" "LOC100506688" "LOC100506710"
## [193] "LOC100506714" "LOC100506746" "LOC100506844" "LOC100506866"
## [197] "LOC100506930" "LOC100506963" "LOC100506994" "LOC100507032"
## [201] "LOC100507034" "LOC100507062" "LOC100507217" "LOC100507246"
## [205] "LOC100507373" "LOC100507392" "LOC100507404" "LOC100507423"
## [209] "LOC100507424" "LOC100507463" "LOC100507495" "LOC100507501"
## [213] "LOC100507557" "LOC100507567" "LOC100507577" "LOC100507632"
## [217] "LOC100509894" "LOC100527964" "LOC100616668" "LOC100630923"
## [221] "LOC100652999" "LOC100861402" "LOC144571" "LOC145783"
## [225] "LOC145820" "LOC147670" "LOC147727" "LOC147804"
## [229] "LOC148189" "LOC149837" "LOC150776" "LOC151174"
## [233] "LOC151475" "LOC151534" "LOC152217" "LOC152225"
## [237] "LOC153684" "LOC154761" "LOC155060" "LOC158257"
## [241] "LOC158572" "LOC200772" "LOC202181" "LOC202781"
## [245] "LOC219347" "LOC220729" "LOC220906" "LOC221442"
## [249] "LOC253039" "LOC254100" "LOC254128" "LOC255512"
## [253] "LOC257396" "LOC283089" "LOC283143" "LOC283299"
## [257] "LOC283624" "LOC283663" "LOC283693" "LOC283888"
## [261] "LOC284023" "LOC284385" "LOC284440" "LOC284454"
## [265] "LOC284551" "LOC284751" "LOC284837" "LOC284865"
## [269] "LOC284950" "LOC285033" "LOC285074" "LOC285103"
## [273] "LOC285359" "LOC285540" "LOC285696" "LOC285954"
## [277] "LOC285965" "LOC285972" "LOC286059" "LOC286186"
## [281] "LOC286467" "LOC338758" "LOC338799" "LOC338817"
## [285] "LOC339166" "LOC339290" "LOC339524" "LOC339803"
## [289] "LOC339874" "LOC340037" "LOC340515" "LOC340544"
## [293] "LOC349196" "LOC374443" "LOC375190" "LOC386597"
## [297] "LOC387646" "LOC387647" "LOC388387" "LOC389634"
## [301] "LOC389641" "LOC400027" "LOC400604" "LOC400657"
## [305] "LOC400891" "LOC400958" "LOC401074" "LOC401093"
## [309] "LOC401320" "LOC401397" "LOC401588" "LOC439994"
## [313] "LOC440173" "LOC440288" "LOC440300" "LOC440600"
## [317] "LOC440944" "LOC440970" "LOC441666" "LOC442028"
## [321] "LOC541471" "LOC541473" "LOC550112" "LOC550643"
## [325] "LOC595101" "LOC613037" "LOC619207" "LOC641298"
## [329] "LOC641367" "LOC642361" "LOC642852" "LOC643406"
## [333] "LOC643529" "LOC643623" "LOC643733" "LOC643770"
## [337] "LOC643837" "LOC644172" "LOC644242" "LOC644714"
## [341] "LOC645166" "LOC645212" "LOC645513" "LOC645676"
## [345] "LOC645752" "LOC646214" "LOC646329" "LOC646762"
## [349] "LOC646851" "LOC648740" "LOC648987" "LOC652276"
## [353] "LOC653061" "LOC653160" "LOC653501" "LOC653712"
## [357] "LOC678655" "LOC728024" "LOC728190" "LOC728431"
## [361] "LOC728537" "LOC728554" "LOC728558" "LOC728606"
## [365] "LOC728730" "LOC728752" "LOC728855" "LOC729678"
## [369] "LOC729737" "LOC729799" "LOC730091" "LOC730102"
## [373] "LOC730227" "LOC731424" "LOC731789" "LOC92249"
## [377] "LOC93622" "LOC96610" "LONP2" "LSMD1"
## [381] "MATR3" "MGC21881" "MGC57346" "MIR1248"
## [385] "MIR3653" "MRP63" "MST4" "NAA38"
## [389] "NBPF11" "NBPF16" "NBPF24" "NS3BP"
## [393] "OSTC" "OSTCP1" "PAR-SN" "PARPBP"
## [397] "PARS2" "PIDD" "PLSCR3" "PMS2P5"
## [401] "PRH1-PRR4" "PRKRIP1" "PROSC" "PROSER1"
## [405] "PSIP1" "PSKH1" "SGK3" "SNHG4"
## [409] "SPDYE2" "SPDYE5" "TARS" "TARS2"
## [413] "THADA" "THAP1" "TMED1" "TMED10"
## [417] "TMEM184A" "TMEM184B" "TMEM5" "TMEM50A"
## [421] "TMEM86A" "TMEM86B" "TNFAIP8L3" "TNFRSF10A"
## [425] "TREML4" "TRERF1" "TRIM8" "TRIM9"
## [429] "TSG101" "TSGA10" "TXN" "TXN2"
## [433] "UBE2O" "UBE2Q1" "UGCG" "UGDH"
## [437] "UQCR10" "UQCR11" "UQCRQ" "URB1"
## [441] "WASH2P" "WASH3P" "WDR7" "WDR70"
## [445] "ZFP14" "ZFP161" "ZHX2" "ZHX3"
## [449] "ZNF169" "ZNF17" "ZNF189" "ZNF19"
## [453] "ZNF195" "ZNF197" "ZNF253" "ZNF254"
## [457] "ZNF26" "ZNF260" "ZNF281" "ZNF436"
## [461] "ZNF438" "ZNF500" "ZNF501" "ZNF543"
## [465] "ZNF544" "ZNF671" "ZNF672" "ZNF674"
## [469] "ZNF675" "ZNF736" "ZNF737" "ZNF768"
## [473] "ZNF77" "ZNHIT1" "ZNHIT2" "ZNRF3"
If I remember correctly, the gene names listed here look an awful lot like the list of gene names who origianlly were missing ensemble gene ids! Let’s check:
orig_missing_ensembl <- length(which(gene_duplicates %in% PM_exp$`Gene Name`[is.na(PM_exp$`Ensembl Gene ID`)]))
length(which(gene_duplicates %in% PM_exp$`Gene Name`[is.na(PM_exp$`Ensembl Gene ID`)])) #406
## [1] 406
Wow! Most of the genes that are duplicates originally had no ensembl gene ids! As these duplicates make up around 3% of my dataset, I am going to leave all of these values in. I don’t feel comfortable removing genes, especially when I am unsure of the fact that the genes that are duplicated are being mapped 100% correctly.
Before perfoming any normalization on my dataset, I just wanted to be able to visualize my data.
I chose to use a boxplot because I found it to be the easiest representation to view the data as it showed distributions of each sample’s (PM_#) values and lots of information about them in one plot (interquartile range, first and third quartiles, and outliers).
I also used a denstiy plot as it is similar to a histogram, but you are able to easily view the distribution of data over a continuous interval of patient’s expression of the genes.
Now that I have been able to get an overview of what my data looks like, I will normalize the data:
filtered_data_matrix <- as.matrix(PM_exp_filtered[,3:40])
rownames(filtered_data_matrix) <- PM_exp_filtered$`Ensembl Gene ID`
d = DGEList(counts=filtered_data_matrix, group=samples$time)
d = calcNormFactors(d)
normalized_counts <- cpm(d)
A few of the normalized factors can be displayed:
| group | lib.size | norm.factors | |
|---|---|---|---|
| PM01_T0 | T0 | 33572965 | 0.8029624 |
| PM01_T3 | T3 | 32940023 | 0.9024528 |
| PM02_T0 | T0 | 28698158 | 1.0369125 |
| PM02_T3 | T3 | 28868780 | 1.0366107 |
| PM05_T0 | T0 | 24130290 | 0.9819000 |
We can see that there will be minor modifications to the dataset, but these modifications will still have a slight impact (as seen from the norm.factors column).
From this plot, I can automatically see that all patients, before and after surgery had very similar interquartile ranges, with a mean around 4. There seemed to be quite a few outliers, many on the more negative side, indicating much lower expression occurred slightly more frequently than very high expression.
The differences between pre-normalization and post-normalization are very minimal, especially in this graph. The different lines indiciating different patients are tigher together, however the mean has not shifted much. It seems that most pateints gene expression hovers around 0.18, with PM02_T3 dipping slightly lower at 0.15.
Now we have caught up with how the dataset was normalized and cleaned, and have a copy of it for further analysis!
kable(PM_exp_filtered[1:10,1:10], format = "html")
| Gene Name | Ensembl Gene ID | PM01_T0 | PM01_T3 | PM02_T0 | PM02_T3 | PM05_T0 | PM05_T3 | PM06_T0 | PM06_T3 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1/2-SBSRNA4 | ENSG00000247950 | 62.71930 | 62.29415 | 66.85663 | 41.28218 | 50.56757 | 22.50039 | 33.88546 | 25.70598 |
| 2 | A1BG | ENSG00000121410 | 97.01571 | 87.47226 | 67.36065 | 83.48856 | 186.39012 | 165.66641 | 130.99126 | 136.03277 |
| 3 | A1BG-AS1 | ENSG00000268895 | 63.30869 | 69.03770 | 62.58948 | 26.72385 | 118.38336 | 118.66857 | 64.01149 | 68.88014 |
| 5 | A2LD1 | ENSG00000134864 | 174.01275 | 155.20349 | 71.78328 | 90.46922 | 118.88036 | 117.50502 | 102.13343 | 111.10433 |
| 6 | A2M | ENSG00000175899 | 19.36841 | 16.66238 | 37.17075 | 18.27523 | 15.14377 | 37.73556 | 32.69034 | 26.84309 |
| 13 | AAAS | ENSG00000094914 | 595.53115 | 596.03090 | 494.07056 | 567.92197 | 476.07756 | 473.95374 | 488.67144 | 430.47829 |
| 14 | AACS | ENSG00000081760 | 174.93491 | 220.07247 | 213.37203 | 195.32551 | 218.75090 | 229.04913 | 174.27876 | 224.90844 |
| 21 | AAGAB | ENSG00000103591 | 932.06316 | 939.71988 | 994.14134 | 1006.52816 | 714.96474 | 747.77940 | 838.74404 | 723.04105 |
| 22 | AAK1 | ENSG00000115977 | 1318.43352 | 2106.26480 | 2545.63080 | 2495.16544 | 1405.69820 | 1428.93760 | 1780.92956 | 1493.12287 |
| 23 | AAMP | ENSG00000127837 | 2110.10878 | 2056.08450 | 1639.15665 | 1687.72253 | 1380.47480 | 1649.38669 | 1505.34146 | 1429.78255 |
Now I will begin analysis of gene expression from my dataset.
Before performing any analysis, I will be begin with plotting an MDS graph to analyze the culstering between patients before the surgery and three hours after the surgery.
# Make certain modifications to the dataset to remove issues of errors later in the report
#Only incude the genes whose ensembl gene ids are not NULL
PM_exp_filtered <- PM_exp_filtered[-which(is.na(PM_exp_filtered$`Ensembl Gene ID`)),]
#Get rid of all duplicates as well
PM_exp_filtered <- PM_exp_filtered[-which(PM_exp_filtered$`Gene Name` %in% all_duplicates$`Gene Name`),]
heatmap_matrix <- PM_exp_filtered[,4:ncol(PM_exp_filtered)-1]
rownames(heatmap_matrix) <- PM_exp_filtered$`Ensembl Gene ID`
colnames(heatmap_matrix) <- colnames(PM_exp_filtered[,4:ncol(PM_exp_filtered)-1])
plotMDS(heatmap_matrix, col = rep(c("darkgreen","blue"),10))
As shown in the graph, there is no two clusterings of patients before (T0) surgery and after (T3) surgery. It seems that most patients cluster just after 0 on the x-axis, for both results of before or after surgery. This indicates that my dataset is not the most ideal dataset, as in an ideal dataset the two treatment cases (T0 and T3) are in two very separate clusters. This may just add noise to my results later on in this process.
To have a further look at my dataset, I will plot an MDS graph and analyze the clustering between each individual patient before and after surgery.
pat_colors <- rainbow(10)
pat_colors <- unlist(lapply(pat_colors,FUN=function(x){rep(x,2)}))
plotMDS(heatmap_matrix, col = pat_colors )
As shown in the graph, there is a slight clustering within patients before (T0) and after surgery (T3). This is not ideal within a dataset, since patients should not cluster around eachother in the treatment and control cases (T0 and T3). Again, this indicates that noise will be present and may be an issue later on in the gene expression analysis.
Now, I will begin gene expression analysis of my dataset.
First, I will create a linear model using Limma(Ritchie et al. 2015).
model_design <- model.matrix(~ samples$time)
kable(model_design, type="html")
| (Intercept) | samples$timeT3 |
|---|---|
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
Next, I will fit my dataset to the linear model by applying empirical Bayes, which uses my data to specify the baseline, or set a prior observation. Then, I will create a table in which the p-values and adjusted p-values for gene expression are calculated.
expressionMatrix <- as.matrix(PM_exp_filtered[,3:40])
rownames(expressionMatrix) <- PM_exp_filtered$`Ensembl Gene ID`
colnames(expressionMatrix) <- colnames(PM_exp_filtered)[3:40]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
fit <- lmFit(minimalSet, model_design)
fit2 <- eBayes(fit,trend=TRUE)
topfit <- topTable(fit2,
coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(PM_exp_filtered[,c(2,41)],
topfit,
by.y=0,by.x=1,
all.y=TRUE)
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
#view hits
kable(output_hits[1:10,],type="html")
| Ensembl Gene ID | HUGO_symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|---|
| 10300 | ENSG00000196935 | SRGAP1 | 20.69417 | 45.12257 | 3.860254 | 0.0004222 | 0.9999188 | -4.595103 |
| 6417 | ENSG00000153283 | CD96 | 25.67070 | 51.88529 | 3.720099 | 0.0006360 | 0.9999188 | -4.595104 |
| 11960 | ENSG00000269821 | KCNQ1OT1 | 60.93707 | 186.82364 | 3.338608 | 0.0018822 | 0.9999188 | -4.595107 |
| 276 | ENSG00000020426 | MNAT1 | -20.75998 | 78.76815 | -3.305797 | 0.0020617 | 0.9999188 | -4.595107 |
| 11638 | ENSG00000248905 | FMN1 | 154.66557 | 441.55675 | 3.243284 | 0.0024497 | 0.9999188 | -4.595107 |
| 6728 | ENSG00000157933 | SKI | 302.04315 | 2099.74290 | 3.191329 | 0.0028240 | 0.9999188 | -4.595108 |
| 2544 | ENSG00000109790 | KLHL5 | -298.23446 | 2083.81718 | -3.188445 | 0.0028463 | 0.9999188 | -4.595108 |
| 318 | ENSG00000026652 | AGPAT4 | 218.70614 | 977.10470 | 3.143107 | 0.0032195 | 0.9999188 | -4.595108 |
| 8274 | ENSG00000170264 | FAM161A | 25.32429 | 59.14808 | 3.082983 | 0.0037863 | 0.9999188 | -4.595109 |
| 9815 | ENSG00000185862 | EVI2B | -4826.44290 | 22858.95871 | -3.058255 | 0.0040457 | 0.9999188 | -4.595109 |
Next, I want to see how many of my genes had significant enough expression before, and after adjustment.
#How many gene pass the threshold p-value < 0.05?
length(which(output_hits$P.Value < 0.05))
## [1] 223
#How many genes pass correction?
length(which(output_hits$adj.P.Val < 0.05))
## [1] 0
It seems that there are 223 genes that are significant before adjustment, but none after. This indicates that futher processing must be completed to get p-values that have been adjusted that are still significant.
Multiple Hypothesis Testing
A method that can be used to improve the results of the adjusted p-values is multiple hypothesis testing! This helps to control for patient variability, by taking both the patients’ gene expression AND the number of patients into account, instead of just the patients’ gene expression.
First, the linear model needs to be updated so that it now takes into consideration the patients’ gene expression and the number of patients.
model_design_pat <- model.matrix(~ samples$patients + samples$time)
model_design_pat[1:10,1:5]
## (Intercept) samples$patientsPM02 samples$patientsPM05 samples$patientsPM06
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 1 0 0
## 4 1 1 0 0
## 5 1 0 1 0
## 6 1 0 1 0
## 7 1 0 0 1
## 8 1 0 0 1
## 9 1 0 0 0
## 10 1 0 0 0
## samples$patientsPM08
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 1
## 10 1
Then, we again fit the model using empirical Bayes.
fit_pat <- lmFit(minimalSet, model_design_pat)
fit2_pat <- eBayes(fit_pat,trend=TRUE)
topfit_pat <- topTable(fit2_pat,
coef=ncol(model_design_pat),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits_pat <- merge(PM_exp_filtered[,c(2,41)],
topfit_pat,by.y=0,by.x=1,all.y=TRUE)
#sort by pvalue
output_hits_pat <- output_hits_pat[order(output_hits_pat$P.Value),]
kable(output_hits_pat[1:10,],type="html")
| Ensembl Gene ID | HUGO_symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|---|
| 6152 | ENSG00000149428 | HYOU1 | 224.10735 | 1584.97806 | 5.214442 | 0.0000409 | 0.2939791 | -4.531650 |
| 10178 | ENSG00000196218 | RYR1 | 18.61091 | 128.37796 | 4.784806 | 0.0001103 | 0.2939791 | -4.536787 |
| 4801 | ENSG00000135540 | NHSL1 | 26.21227 | 101.38153 | 4.722608 | 0.0001274 | 0.2939791 | -4.537575 |
| 8501 | ENSG00000172037 | LAMB2 | 74.16278 | 366.66467 | 4.718775 | 0.0001286 | 0.2939791 | -4.537624 |
| 8102 | ENSG00000168890 | TMEM150A | 107.76304 | 528.90158 | 4.701590 | 0.0001338 | 0.2939791 | -4.537844 |
| 11546 | ENSG00000242732 | RGAG4 | 66.27998 | 313.35280 | 4.667872 | 0.0001448 | 0.2939791 | -4.538278 |
| 11960 | ENSG00000269821 | KCNQ1OT1 | 60.93707 | 186.82364 | 4.608272 | 0.0001664 | 0.2939791 | -4.539054 |
| 276 | ENSG00000020426 | MNAT1 | -20.75998 | 78.76815 | -4.490611 | 0.0002191 | 0.3023822 | -4.540616 |
| 11254 | ENSG00000227460, ENSG00000197283 | NA | 46.40860 | 282.45028 | 4.488754 | 0.0002201 | 0.3023822 | -4.540641 |
| 3401 | ENSG00000119574 | ZBTB45 | -39.56290 | 135.70651 | -4.418770 | 0.0002593 | 0.3078740 | -4.541590 |
Next, we look at the p-values again like we did previously.
#How many gene pass the threshold p-value < 0.05?
length(which(output_hits_pat$P.Value < 0.05))
## [1] 933
#How many genes pass correction?
length(which(output_hits_pat$adj.P.Val < 0.05))
## [1] 0
Unfortunately, the results are again slightly disappointing in that there are still no genes with adjusted p-values that have significant gene expression.
When I was using Limma, I could not get any significantly differentially expressed genes, before or after multiple hypothesis testing.
I then re-read my paper and noticed that they also performed gene differential analysis, but used a different package than Limma, as the data seemed to follow a negative binomal distribution. Though Limma works well on data that follows a linear distribution, it should not be used by my data as my data follows a negative binomial distribution instead. EdgeR(MD, DJ, and GK 2010) is another package that can be used for differential expression, and is great for data that follows a negative binomial distribution - just like mine!
I will confirm that my data follows a negative binomial distribution by plotting the mean-variance graph of my data:
filtered_data_matrix <- as.matrix(PM_exp_filtered[,3:40])
rownames(filtered_data_matrix) <- PM_exp_filtered$`Ensembl Gene ID`
d = DGEList(counts=filtered_data_matrix, group=samples$time)
d <- estimateDisp(d, model_design_pat)
plotMeanVar(d, show.raw.vars = TRUE, show.tagwise.vars=TRUE,
show.ave.raw.vars = TRUE,
NBline=TRUE,
show.binned.common.disp.vars = TRUE)
The blue line in the above graph is the negative binomial line. As you can see by the red X’s, my data follows the negative binomial distribution. Therefore, I can use the edgeR package on my dataset!
First, I will use the edgeR glmQLFTest to fit my dataset. I will fit my dataset using the multiple hypothesis test like I did with Limma, where I used both the number of patients and the pateints’ gene expression in an attempt to control for the patients’ variability.
#model_design_pat already is set to use both the number of patients and the patients' data
fit <- glmQLFit(d, model_design_pat)
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$timeT3')
kable(topTags(qlf.pos_vs_neg), type="html")
|
|
|
|
Next, I will check to see if there are any genes with significant gene expression using this new model.
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue", n = nrow(PM_exp_filtered))
length(which(qlf_output_hits$table$PValue < 0.05))
## [1] 2911
length(which(qlf_output_hits$table$FDR < 0.05)) #FDR is adjusted p-values
## [1] 138
Fortunately, there are 138 genes with significant gene expression, even after adjustment!
Next, I will plot the data on a heatmap(Gu, Eils, and Schlesner 2016), and look at the differences between patients before surgery (T0) and after surgery (T3).
top_hits <- rownames(qlf_output_hits$table)[qlf_output_hits$table$FDR<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
heatmap_matrix_tophits<- heatmap_matrix_tophits[,
c(grep(colnames(heatmap_matrix_tophits),pattern = "T0"), grep(colnames(heatmap_matrix_tophits),pattern = "T3"))]
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
)
current_heatmap
The heatmap indicates that there is no or very little clustering by conditions. The expression levels vary in both conditions to a large degree, though there seems to be slightly less gene expression before surgery (T0), indicated by a purple colour and there is slightly more gene expression after surgery (T3), indicated by a red colour.
Finally, I will use an MA plot to look at genes of interest in my dataset:
First, I will use a volcano plot to plot the upregulated genes; they can be seen as the orange circles on the graph.
plotVolcano <- cbind(qlf_output_hits$table$logFC, -log10(qlf_output_hits$table$FDR))
colnames(plotVolcano) <- c("logFC", "P-val")
upregulated <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC > 1
downregulated <- qlf_output_hits$table$FDR < 0.05 & qlf_output_hits$table$logFC < 0
point.col <- ifelse(upregulated, "orange", "black")
plot(plotVolcano, col = point.col)
Next, I will use a volcano plot to plot the downregulated genes; they can be seen as green circles on the graph.
point.col <- ifelse(downregulated, "green", "black")
plot(plotVolcano, col = point.col)
I am going to continue to use edgeR as it allowed me to find genes that passed correction. With edgeR, I will be finding the downregulated and upregulated genes from my dataset.
d <- calcNormFactors(d)
fit <- glmQLFit(d, model_design_pat)
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$timeT3')
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
n = nrow(filtered_data_matrix))
length(which(qlf_output_hits$table$PValue < 0.05 & qlf_output_hits$table$logFC > 0))
## [1] 732
length(which(qlf_output_hits$table$PValue < 0.05 & qlf_output_hits$table$logFC < 0))
## [1] 709
I was able to find that there are 732 upregulated genes, and 709 downregulated genes! Now I will find the ensembl gene IDs of these genes and save them in text files, which I will put into the g:profiler(Raudvere et al. 2019) webpage so that I can see what datasets match those genes best.
qlf_output_hits_withgn <- merge(PM_exp_filtered[,1:2],qlf_output_hits, by.x=2, by.y = 0)
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
#Get all of the gene sets
upregulated_genes <- qlf_output_hits_withgn$`Ensembl Gene ID`[which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$`Ensembl Gene ID`[which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC < 0)]
all_genes <- c(downregulated_genes, upregulated_genes)
#Save all gene sets to text files to be able to use in g:profiler
write.table(x=upregulated_genes,
file=file.path("data","PM_upregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
file=file.path("data","PM_downregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=all_genes,
file=file.path("data","PM_all_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
Now that I have text files with all of the necessary genes, I will use these text files in g:profiler. The results can be found in images below.
The annotation data sets that I used were GO biological process, Reactome, and WikiPathways. There were no results for WikiPathways, so this annotation data set results will not be shown below. I used these annotation data sets because I am familiar with how they all work from previous assignments, as well as they all have important strengths that can be leveraged. Reactome is useful in identifying full molecular details of a pathway, while GO is useful to learn more about the outcome or ending state of molecular functions. I am using version e98_eg45_p14_ce5b097 of the annotation data sets.
I used the same thresholds from when I performed gene expression analysis; again I will be looking at FDR (Benjamini-Hochberg method) that are below 0.05, as this is the usual threshold that is set for other quantifiers such as p-values.
The amount of genesets returned from each annotation data set for each query performed are shown below.
Up-Regulation Results
Go biological pathways: 23 genesets
Reactome: 53 genesets
Down-Regulation Results
GO biological pathways: 91 genesets
Reactome: 3 genesets
Combined Up-Regulation and Down-Regulation Results
GO biological pathways: 101 genesets
Reactome: 14 genesets
Results for up-regulation are as follows:
The top term genesets returned have to do with response to heat and transporting mRNA.
Results for down-regulation are as follows:
The top term genesets returned have to do with building and breaking down cells, and actions of blood cells (oxygenating, deoxygenating, degranulation).
Results for both down-regulation and up-regulation are as follows:
The top term genesets returned have to do with SUMOylation of different proteins, and metabolic and catabolic processes.
Comparing GO results from all
It seems that all the results for GO invovled some form of regulation or organization of different parts or aspects of the cell. It seems that for the results where both down-regulation and up-regulation were included, the top term genesets returned were a combination of the results from down-regulation and up-regulation, as to be expected when combining the genes together.
Comparing Reactome results from all
It seems that the results for Reactome were again, a combination of both genesets returned from down-regulation and up-regulation. However, as opposed to GO, instead of having an almost equal spread of top term genesets between down and up-regulation, most of the top term genesets were from the up-regulation rather than the down-regulation.
In the paper I chose, they also performed gene expression analysis and thersholding, so I could compare my results with theirs.
The paper specifies that the upregulated genes were “mostly involved in the basal cellular machinery.” Basal cellular machinery has to do with forming the basic building blocks of the cell. This aligns with my results, as my results have to do with transporting mRNA, which is one of the basic building blocks to create cells.
The paper specifies that the down-regulated genes were “enriched in metabolic functions of adipose tissue”. This corresponds well with my results, since the results I found were genesets that broke and created cells, which are metabolic functions.
There are other papers who perform research to observe the effects after bariatric surgery on another’s body.
Another article(Santos et al. 2014) I found performed research on different aspects of patients’ physical and cellular heath before and three months after bariastric surgery. The results that they found were that the patients’ weight went down, and there were decreased neutrophil and triglyceride counts. The results that I found result were similar to the results of this paper as the genesets that were downregulated were actions of blood cells (like nutrophils), as well as breaking down cells like triglycerides. Also, my results indicate that response to heat is up-regulated, leading me to believe that the increased response to heat causes the patients to lose weight, just as the paper described.
There was also an article(Sparks et al. 2015) that reported findings on arthritic patients’ health before and after bariatric surgery. The results were similar to the article above, where the patients again lost weight, but in addition, in this article it was documented that there was a lower erythrocyte sedimentation rate. This again matches with my results, as a down-regulated gene was oxygentation and deoxygenation of blood cells, which means that erythrocyte sedimentation rate will decrease as a result since the erythrocytes ability to fall quickly.
Gu, Zuguang, Roland Eils, and Matthias Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics.
MD, Robinson, McCarthy DJ, and Smyth GK. 2010. “edgeR a Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (7): 139–40.
Morgan, Martin. 2019. BiocManager: Access the Bioconductor Project Package Repository. https://CRAN.R-project.org/package=BiocManager.
Poitou, Christine, Claire Perret, François Mathieu, Vinh Truong, Yuna Blum, Hervé Durand, Rohia Alili, et al. 2015. “Bariatric Surgery Induces Disruption in Inflammatory Signaling Pathways Mediated by Immune Cells in Adipose Tissue: A Rna-Seq Study.” PLoS One 10 (5). Public Library of Science.
Raudvere, Uku, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and Jaak Vilo. 2019. “g:Profiler a Web Server for Functional Enrichment Analysis and Conversions of Gene Lists (2019 Update.” Nucleic Acids Research. https://doi.org/10.1093/nar/gkz369.
Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.
Santos, J, P Salgado, C Santos, P Mendes, J Saavedra, P Baldaque, L Monteiro, and E Costa. 2014. “Effect of Bariatric Surgery on Weight Loss, Inflammation, Iron Metabolism, and Lipid Profile.” Scandinavian Journal of Surgery 103 (1). SAGE Publications Sage UK: London, England: 21–25.
Sparks, Jeffrey A, Florencia Halperin, Jonathan C Karlson, Elizabeth W Karlson, and Bonnie L Bermas. 2015. “Impact of Bariatric Surgery on Patients with Rheumatoid Arthritis.” Arthritis Care & Research 67 (12). Wiley Online Library: 1619–26.
Zhu, Yuelin, Sean Davis, Robert Stephens, Paul S. Meltzer, and Yidong Chen. 2008. “GEOmetadb: Powerful Alternative Search Engine for the Gene Expression Omnibus.” Bioinformatics (Oxford, England) 24 (23): 2798–2800. https://doi.org/10.1093/bioinformatics/btn520.